AITopics | dependency chain

Collaborating Authors

dependency chain

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AMLA: MUL by ADD in FlashAttention Rescaling

Liao, Qichen, Hu, Chengqiu, Miao, Fangzheng, Li, Bao, Liu, Yiyang, Lyu, Junlong, Jiang, Lirui, Wang, Jun, Zheng, Lingchao, Li, Jun, Fan, Yuwei

arXiv.org Artificial IntelligenceOct-23-2025

Multi-head Latent Attention (MLA) significantly reduces KVCache memory usage in Large Language Models while introducing substantial computational overhead and intermediate variable expansion. This poses challenges for efficient hardware implementation -- especially during the decode phase. This paper introduces Ascend MLA (AMLA), a high-performance kernel specifically optimized for Huawei's Ascend NPUs. AMLA is built on two core innovations: (1) A novel FlashAttention-based algorithm that replaces floating-point multiplications with integer additions for output block rescaling, leveraging binary correspondence between FP32 and INT32 representations; (2) A Preload Pipeline strategy with hierarchical tiling that maximizes FLOPS utilization: the Preload Pipeline achieves Cube-bound performance, while hierarchical tiling overlaps data movement and computation within the Cube core. Experiments show that on Ascend 910 NPUs (integrated in CloudMatrix384), AMLA achieves up to 614 TFLOPS, reaching 86.8% of the theoretical maximum FLOPS, outperforming the state-of-the-art open-source FlashMLA implementation, whose FLOPS utilization is up to 66.7% on NVIDIA H800 SXM5. The AMLA kernel has been integrated into Huawei's CANN and will be released soon.

dependency chain, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2509.25224

Country: North America > United States (0.04)

Genre: Research Report (0.40)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Rina: Enhancing Ring-AllReduce with In-network Aggregation in Distributed Model Training

Chen, Zixuan, Liu, Xuandong, Li, Minglin, Hu, Yinfan, Mei, Hao, Xing, Huifeng, Wang, Hao, Shi, Wanxin, Liu, Sen, Xu, Yang

arXiv.org Artificial IntelligenceJul-29-2024

Parameter Server (PS) and Ring-AllReduce (RAR) are two widely utilized synchronization architectures in multi-worker Deep Learning (DL), also referred to as Distributed Deep Learning (DDL). However, PS encounters challenges with the ``incast'' issue, while RAR struggles with problems caused by the long dependency chain. The emerging In-network Aggregation (INA) has been proposed to integrate with PS to mitigate its incast issue. However, such PS-based INA has poor incremental deployment abilities as it requires replacing all the switches to show significant performance improvement, which is not cost-effective. In this study, we present the incorporation of INA capabilities into RAR, called RAR with In-Network Aggregation (Rina), to tackle both the problems above. Rina features its agent-worker mechanism. When an INA-capable ToR switch is deployed, all workers in this rack run as one abstracted worker with the help of the agent, resulting in both excellent incremental deployment capabilities and better throughput. We conducted extensive testbed and simulation evaluations to substantiate the throughput advantages of Rina over existing DDL training synchronization structures. Compared with the state-of-the-art PS-based INA methods ATP, Rina can achieve more than 50\% throughput with the same hardware cost.

ina switch, rina, throughput, (15 more...)

arXiv.org Artificial Intelligence

2407.19721

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > Panama (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.84)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Communications > Networks (0.93)

Add feedback

Wide-AdGraph: Detecting Ad Trackers with a Wide Dependency Chain Graph

Kargaran, Amir Hossein, Akhondzadeh, Mohammad Sadegh, Heidarpour, Mohammad Reza, Manshaei, Mohammad Hossein, Salamatian, Kave, Sattary, Masoud Nejad

arXiv.org Machine LearningMay-10-2021

Websites use third-party ads and tracking services to deliver targeted ads and collect information about users that visit them. These services put users' privacy at risk, and that is why users' demand for blocking these services is growing. Most of the blocking solutions rely on crowd-sourced filter lists manually maintained by a large community of users. In this work, we seek to simplify the update of these filter lists by combining different websites through a large-scale graph connecting all resource requests made over a large set of sites. The features of this graph are extracted and used to train a machine learning algorithm with the aim of detecting ads and tracking resources. As our approach combines different information sources, it is more robust toward evasion techniques that use obfuscation or changing the usage patterns. We evaluate our work over the Alexa top-10K websites and find its accuracy to be 96.1% biased and 90.9% unbiased with high precision and recall. It can also block new ads and tracking services, which would necessitate being blocked by further crowd-sourced existing filter lists. Moreover, the approach followed in this paper sheds light on the ecosystem of third-party tracking and advertising.

adtracker, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

doi: 10.1145/3447535.3462549

2004.14826

Genre: Research Report (0.40)

Industry:

Information Technology > Security & Privacy (1.00)
Marketing (0.88)

Technology:

Information Technology > Communications > Social Media > Crowdsourcing (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)

Add feedback